Introduction

This notebook presents an analysis of the data on 17007 strategy games available on the Apple App Store, such as Clash of CLans, Plants vs Zombies, Pokemon GO and others. This dataset was acquired from Kaggle.com, and it was collected on the 3rd of August 2019 using the iTunes API.

With this dataset, we may be able to analyze what factors make a sucessful game.

Loading the dataset

To start this analysis, we first load the required packages (tidyverse, readr) and read the csv file provided by Kaggle.

if(!require(tidyverse)){install.packages("tidyverse")}
if(!require(readr)){install.packages("readr")}
if(!require(DT)){install.packages("DT")}
options(scipen=10000)

appstoreGamesFile = "data/appstore_games.csv"
appstoreGamesDF = read_csv(appstoreGamesFile) %>% rename_all(~str_replace_all(., "\\s+", ""))
summary(appstoreGamesDF)
##      URL                  ID                 Name             Subtitle        
##  Length:17007       Min.   : 284921427   Length:17007       Length:17007      
##  Class :character   1st Qu.: 899654330   Class :character   Class :character  
##  Mode  :character   Median :1112286228   Mode  :character   Mode  :character  
##                     Mean   :1059613815                                        
##                     3rd Qu.:1286982837                                        
##                     Max.   :1475076711                                        
##                                                                               
##    IconURL          AverageUserRating UserRatingCount       Price         
##  Length:17007       Min.   :1.000     Min.   :      5   Min.   :  0.0000  
##  Class :character   1st Qu.:3.500     1st Qu.:     12   1st Qu.:  0.0000  
##  Mode  :character   Median :4.500     Median :     46   Median :  0.0000  
##                     Mean   :4.061     Mean   :   3306   Mean   :  0.8134  
##                     3rd Qu.:4.500     3rd Qu.:    309   3rd Qu.:  0.0000  
##                     Max.   :5.000     Max.   :3032734   Max.   :179.9900  
##                     NA's   :9446      NA's   :9446      NA's   :24        
##  In-appPurchases    Description         Developer          AgeRating        
##  Length:17007       Length:17007       Length:17007       Length:17007      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   Languages              Size            PrimaryGenre          Genres         
##  Length:17007       Min.   :     51328   Length:17007       Length:17007      
##  Class :character   1st Qu.:  22950144   Class :character   Class :character  
##  Mode  :character   Median :  56768954   Mode  :character   Mode  :character  
##                     Mean   : 115706430                                        
##                     3rd Qu.: 133027072                                        
##                     Max.   :4005591040                                        
##                     NA's   :1                                                 
##  OriginalReleaseDate CurrentVersionReleaseDate
##  Length:17007        Length:17007             
##  Class :character    Class :character         
##  Mode  :character    Mode  :character         
##                                               
##                                               
##                                               
## 

As seen by the summary, there are 18 columns in this dataset:

We need to fix the typing of some columns, such as the release dates.

fixedAppstoreGamesDF <- appstoreGamesDF %>%
  mutate(OriginalReleaseDate = as.Date(OriginalReleaseDate, "%d/%m/%Y")) %>%
  mutate(CurrentVersionReleaseDate = as.Date(CurrentVersionReleaseDate, "%d/%m/%Y")) %>%
  mutate(AgeRating = factor(AgeRating, levels=c('4+','9+', '12+', '17+')))
  
appstoreGamesDF <- fixedAppstoreGamesDF
datatable(appstoreGamesDF %>% select(-URL, -ID, -Subtitle, -IconURL, -Description, -Developer))
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html
summary(appstoreGamesDF)
##      URL                  ID                 Name             Subtitle        
##  Length:17007       Min.   : 284921427   Length:17007       Length:17007      
##  Class :character   1st Qu.: 899654330   Class :character   Class :character  
##  Mode  :character   Median :1112286228   Mode  :character   Mode  :character  
##                     Mean   :1059613815                                        
##                     3rd Qu.:1286982837                                        
##                     Max.   :1475076711                                        
##                                                                               
##    IconURL          AverageUserRating UserRatingCount       Price         
##  Length:17007       Min.   :1.000     Min.   :      5   Min.   :  0.0000  
##  Class :character   1st Qu.:3.500     1st Qu.:     12   1st Qu.:  0.0000  
##  Mode  :character   Median :4.500     Median :     46   Median :  0.0000  
##                     Mean   :4.061     Mean   :   3306   Mean   :  0.8134  
##                     3rd Qu.:4.500     3rd Qu.:    309   3rd Qu.:  0.0000  
##                     Max.   :5.000     Max.   :3032734   Max.   :179.9900  
##                     NA's   :9446      NA's   :9446      NA's   :24        
##  In-appPurchases    Description         Developer         AgeRating  
##  Length:17007       Length:17007       Length:17007       4+ :11806  
##  Class :character   Class :character   Class :character   9+ : 2481  
##  Mode  :character   Mode  :character   Mode  :character   12+: 2055  
##                                                           17+:  665  
##                                                                      
##                                                                      
##                                                                      
##   Languages              Size            PrimaryGenre          Genres         
##  Length:17007       Min.   :     51328   Length:17007       Length:17007      
##  Class :character   1st Qu.:  22950144   Class :character   Class :character  
##  Mode  :character   Median :  56768954   Mode  :character   Mode  :character  
##                     Mean   : 115706430                                        
##                     3rd Qu.: 133027072                                        
##                     Max.   :4005591040                                        
##                     NA's   :1                                                 
##  OriginalReleaseDate  CurrentVersionReleaseDate
##  Min.   :2008-07-11   Min.   :2008-08-01       
##  1st Qu.:2014-09-23   1st Qu.:2016-04-17       
##  Median :2016-07-09   Median :2017-07-24       
##  Mean   :2016-03-04   Mean   :2017-04-26       
##  3rd Qu.:2017-12-07   3rd Qu.:2018-11-19       
##  Max.   :2019-10-26   Max.   :2019-10-26       
## 

Univariate Plots

Right now I have no hypotheses to check, but lets create some plots to see the current state of the games released on the app store.

Number of games released each year

First, the number of games released each year. We can see by the plot that the number of games released had been increasing up until 2016. 2017 and 2018 had fewer games released. 2019 is not yet over, so it may catch up to the previous years.

 appstoreGamesDF %>%
  select(OriginalReleaseDate) %>%
  mutate(OriginalReleaseYear = format(OriginalReleaseDate, "%Y")) %>%
  group_by(OriginalReleaseYear) %>%
  summarise(count = n()) %>%
    ggplot(aes(x=OriginalReleaseYear, y=count)) +
    geom_col() +
    geom_text(aes(label=count), vjust=-0.25) + 
    scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
    ylab("Number of games released") +
    xlab("Release Year") +
    theme_minimal()

Number of games with the last update released each year

Ploting the release year of the current version of the games doesn’t give us much information. At best, we can see that the majority of the games have had an update in the last 4 years.

 appstoreGamesDF %>% select(-URL, -ID, -Subtitle, -IconURL, -Description) %>%
  select(CurrentVersionReleaseDate) %>%
  mutate(CurrentVersionRelease = format(CurrentVersionReleaseDate, "%Y")) %>%
  group_by(CurrentVersionRelease) %>%
  summarise(count = n()) %>%
    ggplot(aes(x=CurrentVersionRelease, y=count)) +
    geom_col() +
    geom_text(aes(label=count), vjust=-0.25) + 
    scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
    theme_minimal()

## Number of games per user rating.

In this plot, we can see that the number of games per possible rating score increases in a curved fashin up until the 4.5 score. The perfect 5 score is much less common that 4 and 4.5.

unique(appstoreGamesDF$AverageUserRating)
##  [1] 4.0 3.5 3.0 2.5  NA 2.0 4.5 1.5 5.0 1.0
 appstoreGamesDF %>%
  select(AverageUserRating) %>%
  filter(!is.na(AverageUserRating)) %>%
  group_by(AverageUserRating) %>%
  summarise(count = n()) %>%
    ggplot(aes(x=AverageUserRating, y=count)) +
    geom_col() +
    geom_text(aes(label=count), vjust=-0.25) +
    scale_x_continuous(breaks = seq(1,5,by=0.5)) +
    scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
    theme_minimal()

Number of games per Age Rating.

This plot is simple, it shows that games on the appstore tend to target all ages. Adult games (17+) are rare.

 appstoreGamesDF %>%
  select(AgeRating) %>% 
  arrange(AgeRating) %>%
  group_by(AgeRating) %>%
  summarise(count = n()) %>%
    ggplot(aes(x=AgeRating, y=count)) +
    geom_col() +
    geom_text(aes(label=count), vjust=-0.25) + 
    scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
    theme_minimal()

## Number of Games per Language.

A game on the appStore may be localized in multiple languages. According to this plot, the two most popular languages for games are English (EN) and Chinese (ZH). This likely reflects the language proficiency of the userbase.

 appstoreGamesDF %>%
  select(ID, Languages) %>%
  separate_rows(Languages, sep=",") %>%
  drop_na(Languages) %>%
  group_by(Languages) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  top_n(20) %>%
    ggplot(aes(x=reorder(Languages,desc(count)), y=count)) +
    geom_col() +
    geom_text(aes(label=count), vjust=-0.25, size=3.5) +
    scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
    theme_minimal()
## Selecting by count

Number of Games per Genre

Similar to languages, a game mat have multiple genres. The two most common genres are “Strategy” and “Games”, which is natural since the dataset we are analyzing is about “Strategy Games”. Many games are also classified as “Entertainment”, which is not a game genre. The actual most popular game genre in this dataset is “Puzzle”.

 appstoreGamesDF %>%
  select(ID, Genres) %>%
  separate_rows(Genres, sep=",") %>%
  drop_na(Genres) %>%
  group_by(Genres) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  top_n(20) %>%
    ggplot(aes(x=reorder(Genres,desc(count)), y=count)) +
    geom_col() +
    geom_text(aes(label=count), vjust=-0.25, size=3.5) +
    scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle=90,vjust= 0.2,hjust=1))
## Selecting by count

## Number of free/not-free games

The majority of games in the appstore are free to play, as seen in this plot.

  appstoreGamesDF %>%
  select(Price) %>%
  drop_na(Price) %>%
  mutate(PriceRange = case_when(Price <= 0 ~ "Free",
                                TRUE ~ "Not Free"))%>%
  mutate(PriceRange = factor(PriceRange, levels= c("Free", "Not Free"))) %>%
  group_by(PriceRange) %>%
  summarise(count = n()) %>%
  ggplot(aes(x=PriceRange, y=count))+
  geom_col()+
  geom_text(aes(label=count), vjust=-0.25, size=3.5) +
  scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
  theme_minimal()

Other plots

Finally, I also tried to plot a histogram of the number of games per the number of user ratings However, the histograms is severely unbalanced, that is, most games have very low amounts of user ratings.

  appstoreGamesDF %>%
  select(UserRatingCount) %>%
  filter(!is.na(UserRatingCount))%>%
  filter(UserRatingCount>=10000)%>%
  arrange(UserRatingCount) %>%
    ggplot(aes(x=UserRatingCount)) +
    scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
    geom_histogram(bins=10) +
    theme_minimal() +
    labs(x="Total Number of User Ratings", y="Number of Games")

Multivariate plots

Questions

Unfortunately, there is no information regarding the revenue these games make. We can only speculate that any user that reviews a non-free game has bought it at least once. Thus, we can have model of how much money a game has made compared to others. Of course, this does not consider games with in-app purchases, which is not only the the most common type of game in the Apple Store, but they are also the games that usually make the most amount of money in the mobile gaming community according to the news.

With this crude model, we can relate how most variables impact the revenue of a game: e.g., the amount of languages, a specific language, the genres, the release date, the age rating, the app size, and maybe others.

Is there a correlation between age rating and the number of languages available.

Unfortunatly it is quite difficult to visualize the distribution between these two variables. The values of quartiles are very near one another, so the boxplots are not really useful. The distribution of languages is better shown by ploting a point for each game with some jitter. It was also necessary to crop the y-scale, as some games support more than 90 languages.

  # appstoreGamesDF %>%
  #  select(ID,AgeRating, Languages) %>%
  #  separate_rows(Languages, sep=",") %>%
  #  drop_na(Languages) %>%
  #  group_by(ID,AgeRating) %>%
  #  summarise(numberOfLanguages = n())
 #  arrange(desc(numberOfLanguages))


 appstoreGamesDF %>%
  select(ID,AgeRating, Languages) %>%
  separate_rows(Languages, sep=",") %>%
  drop_na(Languages) %>%
  group_by(ID,AgeRating) %>%
  summarise(numberOfLanguages = n()) %>%
  #ungroup %>%
  #group_by(AgeRating) %>%
  #summarise(averageNumberOfLanguages = mean(numberOfLanguages)) %>%
    ggplot(aes(x=AgeRating, y=numberOfLanguages)) +
    geom_boxplot() +
    geom_jitter(width = 0.3) +
    #geom_text(aes(label=averageNumberOfLanguages), vjust=-0.25, size=3.5) +
    scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
    coord_cartesian(ylim=c(0,90)) +
    xlab("Age Rating") +
    ylab("Number of Supported Languages") +
    theme_minimal()

Is there a correlation between age rating and the number of genres?

Similarly to the previous plot, the best way I found to visualize this question is by plotting points for each game. In this case, it seems that age rating has no inflence on the number of genres. The reduced number of points in higher age ratings probably reflects the number of games per each rating.

 appstoreGamesDF %>%
  select(ID,AgeRating, Genres) %>%
  separate_rows(Genres, sep=",") %>%
  drop_na(Genres) %>%
  group_by(ID,AgeRating) %>%
  summarise(numberOfGenres = n()) %>%
  #summarise(averageNumberOfLanguages = mean(numberOfLanguages)) %>%
    ggplot(aes(x=AgeRating, y=numberOfGenres)) +
    geom_boxplot() +
    geom_jitter(width = 0.3) +
    #geom_text(aes(label=averageNumberOfLanguages), vjust=-0.25, size=3.5) +
    scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
    xlab("Age Rating") +
    ylab("Total Number of Genres") +
    theme_minimal()

Is there a correlation between age rating and average user rating?

To compare the User ratings for each Age Rating category, I summed the total amount of user ratings for each rating level and then calculated the ratio of that amount to the total amount of user ratings. This is displayed in the stacked bar chart below.

appstoreGamesDF %>%
  drop_na(AverageUserRating) %>%
  arrange(AverageUserRating) %>%
  pull(AverageUserRating) %>%
  unique() -> AverageUserRatingLevels #Get a vector containing all possible user rating levels in sequential order.

appstoreGamesDF %>%
  select(ID,AgeRating, AverageUserRating) %>%
  drop_na(AverageUserRating) %>%
  mutate(AverageUserRating = factor(AverageUserRating, levels = AverageUserRatingLevels)) %>%
  group_by(AgeRating, AverageUserRating) %>%
  summarise(count = n()) %>%
  mutate(freq = count / sum(count)) %>%
    ggplot(aes(x=reorder(AgeRating,desc(AgeRating)), y=freq, fill=AverageUserRating)) + 
    geom_col(position = position_stack(reverse = TRUE))  +
    scale_fill_brewer(palette = "RdYlGn") +
    geom_text(aes(label=count), size=4 ,position=position_stack(vjust = .5, reverse = TRUE)) + 
    theme_minimal() +
    xlab("Age Rating") +
    ylab("Proportion (%)") +
    labs(fill="Average\nUser Rating") +
    coord_flip()

# Linear Regressions

There aren’t many possible variables to test for linear regression.

  • Strings
    • URL
    • Name
    • Subtitle
    • Icon URL
    • Description
    • Developer:
  • Categorical Data
    • ID: Unordered
    • Age Rating: Unordered
    • Primary Genre: Unordered
    • Genres: Multiple Values, Unordered
    • Languages: Multiple Values, Unordered
    • Average User Rating: Ordered
  • Continuous data
    • User Rating Count
    • Price
    • Size
    • In-app Purchases: Multiple Values
    • Date
      • Original Release Date
      • Current Version Release Date

Any 88unordered data88 is unsuitable for linear regressions. 88In-app Purchases88 is also not a good varaible to anaylze since it is a list the prices, that can have any number of elements and can have repeated values.

The most likely candidates for a suitable linear regression are Price, User Rating Count, Average User Rating and Size . Starting with the first two:

Linear Regression of Price and the Number of User Ratings

A significant linear regression between User Rating Count and Price was not found.

reg <- lm(data=appstoreGamesDF %>%  drop_na(Price, UserRatingCount), UserRatingCount~Price)
par(mfrow=c(2,2))
plot(reg)

par(mfrow=c(1,1))
summary(reg)
## 
## Call:
## lm(formula = UserRatingCount ~ Price, data = appstoreGamesDF %>% 
##     drop_na(Price, UserRatingCount))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##   -3413   -3403   -3341   -2780 3029316 
## 
## Coefficients:
##             Estimate Std. Error t value         Pr(>|t|)    
## (Intercept)   3418.1      500.2   6.834 0.00000000000889 ***
## Price         -195.3      201.5  -0.969            0.332    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42320 on 7559 degrees of freedom
## Multiple R-squared:  0.0001243,  Adjusted R-squared:  -7.97e-06 
## F-statistic: 0.9397 on 1 and 7559 DF,  p-value: 0.3324
appstoreGamesDF %>%
  drop_na(Price, UserRatingCount) %>%
  select(Price, UserRatingCount) %>%
  ggplot(aes(x=Price, y=UserRatingCount))+
  geom_point() + 
  geom_smooth(method="lm") +
  coord_cartesian(ylim = c(-25000,100000)) +
  theme_minimal()

Linear Regression of Price and the Average User Rating

A significant linear regression between Average user Rating and Price was not found.

reg <- lm(data=appstoreGamesDF %>%  drop_na(Price, AverageUserRating), Price~AverageUserRating)
par(mfrow=c(2,2))
plot(reg)

par(mfrow=c(1,1))
summary(reg)
## 
## Call:
## lm(formula = Price ~ AverageUserRating, data = appstoreGamesDF %>% 
##     drop_na(Price, AverageUserRating))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -0.575  -0.571  -0.571  -0.570 139.419 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.576715   0.152703   3.777  0.00016 ***
## AverageUserRating -0.001332   0.036976  -0.036  0.97126    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.416 on 7559 degrees of freedom
## Multiple R-squared:  1.717e-07,  Adjusted R-squared:  -0.0001321 
## F-statistic: 0.001298 on 1 and 7559 DF,  p-value: 0.9713
appstoreGamesDF %>%
  drop_na(Price, AverageUserRating) %>%
  select(Price, AverageUserRating) %>%
  ggplot(aes(x=AverageUserRating, y=Price))+
  geom_jitter(width=0.15) + 
  geom_smooth(method="lm") +
  coord_cartesian(ylim = c(0,60)) +
  theme_minimal()

Linear Regression of Price and the App Size

There is a significant linear regression between the price of a game and its size in bytes. This seems plausible, bigger games may have a higher price due to the effort spent to create all that data. However, most games are free and earn their revenue through In-App Purchases. Alsoo, not-free games usually have standarized pricing. So the regression has a very small slope. bu

reg <- lm(data=appstoreGamesDF %>%  drop_na(Price, Size), Price~Size)
par(mfrow=c(2,2))
plot(reg)

par(mfrow=c(1,1))
summary(reg)
## 
## Call:
## lm(formula = Price ~ Size, data = appstoreGamesDF %>% drop_na(Price, 
##     Size))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -5.483  -0.807  -0.715  -0.680 179.221 
## 
## Coefficients:
##                    Estimate      Std. Error t value             Pr(>|t|)    
## (Intercept) 0.6639718843973 0.0691516404318   9.602 < 0.0000000000000002 ***
## Size        0.0000000012965 0.0000000002968   4.368            0.0000126 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.832 on 16981 degrees of freedom
## Multiple R-squared:  0.001122,   Adjusted R-squared:  0.001064 
## F-statistic: 19.08 on 1 and 16981 DF,  p-value: 0.0000126
appstoreGamesDF %>%
  drop_na(Price, Size) %>%
  mutate(Size = Size/1000000) %>%
  select(Price, Size) %>%
  ggplot(aes(x=Size, y=Price))+
  geom_point() + 
  geom_smooth(method="lm") +
  xlab("Size (MB)") +
  theme_minimal()

Other questions to explore

There were other questions I thought that could have been analyzed/plotted: